Pre-requisites
Dataframe names:
Install and load packages:
Datasets
Please download the following data files from the [Link] (https://divvy-tripdata.s3.amazonaws.com/index.html) and
copy them into your working directory “data” of RStudio
Data set: ‘202201-divvy-tripdata.csv’ (data for one month
Jan. 2022)
Data set: ‘202202-divvy-tripdata.csv’ (data for one month
Feb. 2022)
Data set: ‘202203-divvy-tripdata.csv’ (data for one month
Mar. 2022)
Data set: ‘202204-divvy-tripdata.csv’ (data for one month
Apr. 2022)
Data set: ‘202205-divvy-tripdata.csv’ (data for one month May.
2022)
Data set: ‘202206-divvy-tripdata.csv’ (data for one month
Jun. 2022)
Data set: ‘202207-divvy-tripdata.csv’ (data for one month
Jul. 2022)
Data set: ‘202208-divvy-tripdata.csv’ (data for one month
Aug. 2022)
Data set: ‘202209-divvy-tripdata.csv’ (data for one month
Sep. 2022)
Data set: ‘202210-divvy-tripdata.csv’ (data for one month
Oct. 2022)
Data set: ‘202211-divvy-tripdata.csv’ (data for one month
Nov. 2022)
Data set: ‘202212-divvy-tripdata.csv’ (data for one month
Dec. 2022)
Reading 12 CSV files into separated dataframes from the local storage:
Next we will merge all 12 dataframes into one single dataframe with all the raw data.
Result:
Result:
Result:
Result:
Result:
Create a dataframe for stations:
Plot station location as virtual map:
There are stations far from Chicago area, probably testing stations.
We will neglect these stations and set the area boundary of GCA
to:
The following stations are identified as outliers:
11 stations are outside of greater Chicago area. They will be removed
in the later cleaning process.
There are plenty of rows with missing values, related to missing
recordings of start and end stations. All rows with NAs will be
removed.
Remove rows with missing values:
Result:
Service stations for maintenance or charging tests should be removed from the dataset. Our first guess was that they can be identified by their long station ID names. However, it turned out that some docking stations with charging devices for electric bikes have long ID names as well. After a review of the 1600 station IDs we identified the following service station names:
Result:
Result:
In order to detect and remove duplicates we will compare all columns except the unique ride_id:
Result:
The imported data with time-stamps for start and end time are
recorded in local time (“America/Chicago” Central Time). In order to
account for daylight saving time (DST) the data and system environment
of RStudio have to be set to the same local time-zone
(TZ).
The time-zone of the imported raw data are set by default to TZ =
“UTC”. Since the time-stamps were recorded in local time we have to
change the time zone of the data and the system environment.
Note: The local standard time-zone like (CST - Central Standard Time)
does not have daylight saving times and using it will result in wrong
ride-time calculations.
Set the system environment TZ:
Setting the system time to local time “America/Chicago”:
Result: System time-zone set to “America/Chicago”
Set date TZ:
Forcing TZ “America/Chicago” on time-stamps:
Result: Date time-zone set to “America/Chicago”
Spring: Advancing clock - time gap
In spring on 2022-03-13 02:00:00 the time is advanced by 1 hour.
There is no time recording for 02:00:00 - 02:59:59. I.e. 01:59:59 + 1
sec = 03:00:00. Ride-time calculation by simple subtraction will create
an error of +60 min. The calculation will be automatically corrected by
the function difftime()
Checking the correct calculation for rides during the time gap:
Result: Calculation is correct. Time gap of 60 minutes skipped in
calculation
Fall: Returning clock - time lap
In autumn on 2022-11-06 02:00:00 the time is returned by 1 hour. The
time between 01:00:00 and 01:59:59 is recorded twice. To eliminate
ambiguity, the time would have had to be measured in UTC (e.g. GPS
tracker). In our data set however the time is measured in local time.
Thus we cannot determine whether the time stamp belongs to the first
path (before DST switch) or the second path (after DST switch).
Therefore, we will remove all time stamps originated or ended between
1am to 2am. Rides originated before 1am and ended after 2am will be
automatically corrected by the function difftime()
Number of dates time slot at DST lap:
Result: 341 rows are affected
Removing dates from time slot at DST lap:
Result:
There are still some rides where the end time is before the start time.
Result: There are 37 rides with negative ride time. All rides
originated and ended at the same station
Remove rides with datetime dependency conflict:
Result:
For further analysis we will now calculate the ride time (duration)
and extract year, month_name, day_name and hour from start time.
We will now convert categorical variables into factors with defined
ordered levels. This will allow us to apply useful graphics:
We will now add sub-groups for season, week and day defined as follows:
First let’s explore the distribution of ride_time by
user group and bike type to identify possible outliers.
Result:
To get more insight into the data we will use histograms for each group:
Result: apparently only the docked_bike type shows outliers. Let’s
zoom-in to ride time 1,500 minutes:
Result:
We don’t see evidence for outliers in groups classic bike and
electric bike due to its continuous distribution, However, in the case
of docked bikes we see an sudden change in the pattern, a kind of noise
starting from 1500.
Therefore, the upper threshold to remove outlier will be set to 1500
minutes.
We make the decision to set the lower threshold to 1 minute. Rides
less than 1 minute can be classified as bike management to dock and
release bikes in order to reset the rent status. For more details see
DIVVY pricing program in chapter IV.
Appendix.
Now, after we have defined the boundary values [1, 1500] let’s remove the outliers:
Result:
The following variables will be removed from the dataframe, as they
are not essential for the further analysis
NOTE: ordered factor setting will be lost in the CSV file. Therefore, we
will create an RDS file that can be used to in the rMarkdown file for
analysis.
Pre-requisites
Global settings for scale and color:
For better readability we will rename the long dataframe name into a short name:
Dataframe names:
Let’s first get an overview of the ride-time distribution in each group by histograms:
Result:
Next let’s calculate for each user group the totals, mean, standard deviation and other statistical measures:
Graph: Total number of rides, total ride-time and average ride-time by user group. In a skewed distribution the average-value (mean) is not so meaningful, instead we will use the median-value:
Result:
For better understanding of the ride-time distribution we will use a
boxplot:
How to read the boxplot:
Results for casuals:
Results for members:
Since the data are highly skewed a statistical test, like t-test, is not meaningful. The data can however be normalized using a log transformation.
Preparing a dataset with log-trans values:
Histogram of log-trans values:
Result: the distribution is now normalized
Boxplot of log-trans values:
Result: The boxplots are now symmetric, i.e. normalized, mean and median
are almost identical.
T-test to test if the difference in mean between
groups is statistically significant:
Result:
Let’s first get an overview of the whole year cycle by number of rides aggregated by day:
Result:
Let’s brake down the signal by month, day and hour:
We will plot the number of rides over year aggregated by months in
absolute values and in relative values
Relative frequency is calculated by :
frq_rides = n_month/n_year per group:
Results:
Average ride-time by month:
Results:
Now, let’s plot the number of rides over a weak aggregated by days in
absolute values and in relative values
Relative frequency is calculated by :
frq_rides = n_day/n_year per group:
Results:
Average ride-time by day:
Results:
Finally, let’s plot the number of rides over a day aggregated by
hours in absolute values and in relative values
Relative frequency is calculated by :
frq_rides = n_hour/n_year per group:
Results:
Average ride-time by hour:
Results:
Year:
Week:
Day:
We will now break down the analysis further by bike-type
classic bike and electric bike to get an
insight whether there are significant differences between the user
groups:
Results:
docked bikes. Since they are only used by casuals we
exclude this type from the further analysisTotal rides per bike-types aggregated by month for weekend and workday:
Results:
Total rides of electric bikes aggregated by hour for different seasons at weekend and workday:
Results:
Total rides of classic bikes aggregated by hour for different seasons, weekend and workday:
Results:
Next, let’s display the top 20 stations for casuals and
members.
For the map visualization we used ggmap and geom_point. The map was
pulled from Google Map. A Google Map API is required. We will skip the
code here, since the map creation code was quite lengthy. If there is
any interest please let me know.
The following maps were created with ggmap and Google Map API:
Results:
A density map visualizes the number of rides accumulated over the
full time span and mapped to the location of start stations. The range
is sliced in 50 levels. Each level is assigned to a color and bounded by
a polygon.
For the visualization we used ggmap and stat_density2d function. The map
was pulled from Google Map. A Google Map API is required. We will skip
the code here, since the map creation code was quite lengthy. If there
is any interest please let me know.
Results:
Relationships (correlations) between casuals and members can be best
visualized in scatter plots. We therefore plot the number of rides of
casuals on the x-axis and the number of rides of members on the
y-axis.
If the groups are correlated, than we can infer that the behavior is not
different
If the groups are only weakly correlated, we can infer that a difference
in behavior exists
We will then use a simple linear regression model provided by
geom_smooth (method = lm). The coefficients of the model are not
accessible from geom_smooth. However, we can calculate the R-value with
a simple function that provides a measure of correlation (+/- 1 for high
correlation and 0 for no correlation). The regression slop values are
manually calculated from the graphs.
We will create a new dataframe df_corr to aggregate the
number of rides by hour. The data points will be reduced to about 8700
rows
Result:
To find some further clues we will look into correlation by season,
weekdays and hour
Now let’s split the data in rides by season. The seasons are defined
as:
Result:
Next let’s split the data in rides by day-type. The day-types are
defined as:
Result:
Finally let’s split the data in rides by day time. The day times is
defined as:
Results:
The relationship of ride numbers between casual and member user were
analyzed by simple linear regression models segmented into season,
weekdays and hours. All models were significant (very small p-values):
The models verify our findings from the time series in chapter
3.2:
How do annual members and casual riders use Cyclistic bikes
differently?
Overall:
By months:
By week:
By day:
By bike type:
By location:
Scenario 1: Introduction of Seasonal
Memberships
Scenario 2: Introduction of Benefit Program for Electric
Bikes
Scenario 3: Convert Casual Riders to Annual
Memberships
Thank you for your interest !